Monte-Carlo control

try:
    import google.colab
    IN_COLAB = True
except:
    IN_COLAB = False

if IN_COLAB:
    !pip install -U gymnasium pygame moviepy
    !pip install gymnasium[box2d]
import numpy as np
rng = np.random.default_rng()
import matplotlib.pyplot as plt
import os

import gymnasium as gym
print("gym version:", gym.__version__)

from moviepy.editor import ImageSequenceClip, ipython_display

class GymRecorder(object):
    """
    Simple wrapper over moviepy to generate a .gif with the frames of a gym environment.
    
    The environment must have the render_mode `rgb_array_list`.
    """
    def __init__(self, env):
        self.env = env
        self._frames = []

    def record(self, frames):
        "To be called at the end of an episode."
        for frame in frames:
            self._frames.append(np.array(frame))

    def make_video(self, filename):
        "Generates the gif video."
        directory = os.path.dirname(os.path.abspath(filename))
        if not os.path.exists(directory):
            os.mkdir(directory)
        self.clip = ImageSequenceClip(list(self._frames), fps=self.env.metadata["render_fps"])
        self.clip.write_gif(filename, fps=self.env.metadata["render_fps"], loop=0)
        del self._frames
        self._frames = []
gym version: 0.26.3

The taxi environment

In this exercise, we are going to apply on-policy Monte-Carlo control on the Taxi environment available in gym:

https://gymnasium.farama.org/environments/toy_text/taxi/

Let’s create the environment in ansi mode, initialize it, and render the first state:

env = gym.make("Taxi-v3", render_mode='ansi')
state, info = env.reset()
print(state)
print(env.render())
374
+---------+
|R: | : :G|
| : | : : |
| : : : : |
| | : | : |
|Y| : |B: |
+---------+

The agent is the black square. It can move up, down, left or right if there is no wall (the pipes and dashes). Its goal is to pick clients at the blue location and drop them off at the purple location. These locations are fixed (R, G, B, Y), but which one is the pick-up location and which one is the drop-off destination changes between each episode.

Q: Re-run the previous cell multiple times to observe the diversity of initial states.

The following cell prints the action space of the environment:

print("Action Space:", env.action_space)
print("Number of actions:", env.action_space.n)
Action Space: Discrete(6)
Number of actions: 6

There are 6 discrete actions: south, north, east, west, pickup, dropoff.

Let’s now look at the observation space (state space):

print("State Space:", env.observation_space)
print("Number of states:", env.observation_space.n)
State Space: Discrete(500)
Number of states: 500

There are 500 discrete states. What are they?

  • The taxi can be anywhere in the 5x5 grid, giving 25 different locations.
  • The passenger can be at any of the four locations R, G, B, Y or in the taxi: 5 values.
  • The destination can be any of the four locations: 4 values.

This gives indeed 25x5x4 = 500 different combinations.

The internal representation of a state is a number between 0 and 499. You can use the encode and decode methods of the environment to relate it to the state variables.

state = env.encode(2, 1, 1, 0) # (taxi row, taxi column, passenger index, destination index)
print("State:", state)

state = env.decode(328) 
print("State:", list(state))
State: 224
State: [3, 1, 2, 0]

The reward function is simple:

  • r = 20 when delivering the client at the correct location.
  • r = -10 when picking or dropping a client illegally (picking where there is no client, dropping a client somewhere else, etc)
  • r = -1 for all other transitions in order to incent the agent to be as fast as possible.

The actions pickup and dropoff are very dangerous: take them at the wrong time and your return will be very low. The navigation actions are less critical.

Depending on the initial state, the taxi will need at least 10 steps to deliver the client, so the maximal return you can expect is around 10 (+20 for the success, -1 for all the steps).

The task is episodic: if you have not delivered the client within 200 steps, the episode stops (no particular reward).

Random agent

Let’s now define a random agent that just samples the action space.

Q: Modify the random agent of last time, so that it accepts the GymRecorder that generates the .gif file.

def train(self, nb_episodes, recorder=None):

The environment should be started in ‘rgb_array_list’ mode, not ‘ansi’. The game looks different but has the same rules.

env = gym.make("Taxi-v3", render_mode='rgb_array_list')
recorder = GymRecorder(env)

As episodes in Taxi can be quite long, only the last episode should be recorded:

if recorder is not None and episode == nb_episodes -1:
    recorder.record(self.env.render())

Perform 10 episodes, plot the obtained returns and vidualize the last episode.

class RandomAgent:
    """
    Random agent exploring uniformly the environment.
    """
    
    def __init__(self, env):
        self.env = env
    
    def act(self, state):
        "Returns a random action by sampling the action space."
        return self.env.action_space.sample()
    
    def update(self, state, action, reward, next_state):
        "Updates the agent using the transition (s, a, r, s')."
        pass
    
    def train(self, nb_episodes, recorder=None):
        "Runs the agent on the environment for nb_episodes. Returns the list of obtained rewards."
        # List of returns
        returns = []

        for episode in range(nb_episodes):

            # Sample the initial state
            state, info = self.env.reset()

            return_episode = 0.0
            done = False
            while not done:

                # Select an action randomly
                action = self.env.action_space.sample()
                
                # Sample a single transition
                next_state, reward, terminal, truncated, info = self.env.step(action)
                
                # Go in the next state
                state = next_state

                # Update return
                return_episode += reward

                # End of the episode
                done = terminal or truncated

            # Record at the end of the episode
            if recorder is not None and episode == nb_episodes -1:
                recorder.record(self.env.render())
            
            # Append return
            returns.append(return_episode)

        return returns
# Create the environment
env = gym.make("Taxi-v3", render_mode='rgb_array_list')
recorder = GymRecorder(env)

# Create the agent
agent = RandomAgent(env)

# Train for 10 episodes
returns = agent.train(10, recorder)

plt.figure(figsize=(15, 6))
plt.plot(returns)
plt.xlabel("Episodes")
plt.ylabel("Return")
plt.show()

video = "videos/taxi.gif"
recorder.make_video(video)
ipython_display(video)

MoviePy - Building file videos/taxi.gif with imageio.
                                                              

Q: What do you think of the returns obtained by the random agent? Conclude on the difficulty of the task.

A: The optimal returns are around 10, but the obtained returns with a random policy are very negative (-700, mostly due to many illegal pickups or dropoffs). One can expect a very huge variance of the returns, so learning will be slow.

On-policy Monte-Carlo control

Now let’s apply on-policy MC control on the Taxi environment. As a reminder, here the meta-algorithm:

  • while True:

    1. Generate an episode \tau = (s_0, a_0, r_1, \ldots, s_T) using the current stochastic policy \pi.

    2. For each state-action pair (s_t, a_t) in the episode, update the estimated Q-value:

    Q(s_t, a_t) = Q(s_t, a_t) + \alpha \, (R_t - Q(s_t, a_t))

    1. For each state s_t in the episode, improve the policy (e.g. \epsilon-greedy):

    \pi(s_t, a) = \begin{cases} 1 - \epsilon \; \text{if} \; a = a^* \\ \frac{\epsilon}{|\mathcal{A(s_t)}-1|} \; \text{otherwise.} \\ \end{cases}

In practice, we will need:

  • a Q-table storing the estimated Q-value of each state-action pair: its size will be (500, 6).

  • an \epsilon-greedy action selection to select actions in the current state.

  • an learning mechanism allowing to update the Q-value of all state-action pairs encountered in the episode.

Q: Create a MonteCarloAgent class implementing on-policy MC for the Taxi environment.

Use \gamma = 0.9, \epsilon = 0.1 and \alpha=0.01 (pass these parameters to the constructor of the agent and store them). Train the agent for 20000 episodes (yes, 20000… Start with one episode to debug everything and then launch the simulation. It should take around one minute). Save the return of each episode in a list, as well as the number of steps of the episode, and plot them in the end.

The environment should be created without rendering (env = gym.make("Taxi-v3"), no recorder).

Implementing the action selection mechanism should not be a problem, it is the same as for bandits. Little trick (not obligatory): you can implement \epsilon-greedy as:

action = self.Q[state, :].argmax()
if rng.random() < epsilon:
    action = self.env.action_space.sample()

This is not exactly \epsilon-greedy, as env.action_space.sample() may select the greedy action again. In practice it does not matter, it only changes the meaning of \epsilon, but the action selection stays similar. It is better to rely on env.action_space.sample() for the exploration, as some Gym problem work better with a normal distribution for the exploration than with uniform (e.g. continuous problems).

Do not select the greedy action with self.Q[state, :].argmax() but rng.random.choice(np.where(self.Q[state, :] == self.Q[state, :].max())[0]): at the beginning of learning, where the Q-values are all 0, you would otherwise always take the first action (south).

The update() method should take a complete episode as argument, using a list of (state, action, reward) transitions. It should be called at the end of an episode only, not after every step.

A bit tricky is the calculation of the returns for each visited state. The naive approach would look like:

T = len(episode)
for t in range(T):
    state, action, reward = episode[t]
    return_state = 0.0
    for k in range(t, T): # rewards coming after t
        next_state, next_action, next_reward = episode[k]
        return_state += gamma**k * reward
    self.Q[state, action] += alpha * (return_state - self.Q[state, action])

The double for loop can be computationally expensive for long episodes (complexity T log T). It is much more efficient to iterate backwards on the episode, starting from the last transition and iterating until the first one, and using the fact that:

R_{t} = r_{t+1} + \gamma \, R_{t+1}

The terminal state s_T has a return of 0 by definition. The last transition s_{T-1} \rightarrow s_{T} has therefore a return of R_{T-1} = r_T. The transition before that has a return of R_{T-2} = r_{T-1} + \gamma \, R_{T-1}, and so on. You can then compute the returns of each action taken in the episode (and update its Q-value) in linear time.

To iterate backwards over the list of transitions, use the reversed() operator:

l = [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]

for a in reversed(l):
    print(a)
class MonteCarloAgent:
    """
    Online Monte-Carlo agent.
    """
    
    def __init__(self, env, gamma, epsilon, alpha):
        """
        :param env: gym-like environment
        :param gamma: discount factor
        :param epsilon: exploration parameter
        :param alpha: learning rate
        """
        self.env = env
        self.gamma = gamma
        self.epsilon = epsilon
        self.alpha = alpha
        
        # Q_table
        self.Q = np.zeros([self.env.observation_space.n, self.env.action_space.n])
    
    def act(self, state):
        "Returns an action using epsilon-greedy action selection."
        
        action = rng.choice(np.where(self.Q[state, :] == self.Q[state, :].max())[0])
        
        if rng.random() < self.epsilon:
            action = self.env.action_space.sample() 
        
        return action
    
    def update(self, episode):
        "Updates the agent using a complete episode."
        # Terminal states have a return of 0
        return_episode = 0.0
        
        # Iterate backwards over the episode
        for state, action, reward in reversed(episode):
            
            # Compute the return
            return_episode = reward + self.gamma * return_episode
            
            # Update the Q-value
            self.Q[state, action] += self.alpha * (return_episode - self.Q[state, action])
            
    
    def train(self, nb_episodes, recorder=False):
        """
        Runs the agent on the environment for nb_episodes. Returns the list of obtained returns and the number of steps.
        """

        # Returns and steps
        returns = []
        steps = []

        # Fixed number of episodes
        for episode in range(nb_episodes):

            # Reset
            state, info = self.env.reset()
            return_episode = 0.
            nb_steps = 0

            # Store transitions
            transitions = []

            # Sample the episode
            done = False
            while not done:

                # Select an action 
                action = self.act(state)

                # Perform the action
                next_state, reward, terminal, truncated, info = self.env.step(action)

                # Store the transition
                transitions.append([state, action, reward])

                # Go in the next state
                state = next_state

                # Terminal state
                done = terminal or truncated

                # Increment time
                nb_steps += 1
                return_episode += reward

            # Update the Monte Carlo agent after the episode is completed
            self.update(transitions)    

            # Store info
            returns.append(return_episode)
            steps.append(nb_steps)
            
            
        return returns, steps
# Parameters
gamma = 0.9
epsilon = 0.1
alpha = 0.01
nb_episodes = 20000

# Create the environment
env = gym.make("Taxi-v3")

# Create the agent
agent = MonteCarloAgent(env, gamma, epsilon, alpha)

# Train the agent 
returns, steps = agent.train(nb_episodes)


plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.plot(returns)
plt.xlabel("Episodes")
plt.ylabel("Returns")
plt.subplot(122)
plt.plot(steps)
plt.xlabel("Episodes")
plt.ylabel("steps")
plt.show()

As you may observe, the returns have a huge variance due to the exploration, what makes the plot quite ugly and unreadable. The following function allows to smooth the returns using a sliding average over the last N epochs. Note that the first values will be off.

def running_average(x, N):
    kernel = np.ones(N) / N
    return np.convolve(x, kernel, mode='same')

Q: Plot the returns and steps, as well as their sliding average. Comment on the influence of exploration.

plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.plot(returns)
plt.plot(running_average(returns, 1000))
plt.xlabel("Episodes")
plt.ylabel("Returns")
plt.subplot(122)
plt.plot(steps)
plt.plot(running_average(steps, 1000))
plt.xlabel("Episodes")
plt.ylabel("steps")
plt.show()

Q: Extend the MonteCarloAgent class with a test method that performs a single episode on the environment without exploration, optionally records the episode but does not learn.

class MonteCarloAgentTest (MonteCarloAgent):
    """
    Online Monte-Carlo agent with a test method.
    """

    def test(self, recorder=None):
        # ...

In the test method, backup the previous value of epsilon in a temporary variable and reset it at the end of the episode. Have the method return the undiscounted return of the episode, as well as the number of steps until termination.

Perform 1000 test episodes without rendering and report the mean return over these 1000 episodes as the final performance of your agent.

Tip: To avoid re-training the agent, simply transfer the Q-table from the previous agent:

test_agent = MonteCarloAgentTest(env, gamma, epsilon, alpha)
test_agent.Q = agent.Q

return_episode, nb_steps = test_agent.test()
class MonteCarloAgentTest (MonteCarloAgent):
    """
    Online Monte-Carlo agent with a test method.
    """

    def test(self, recorder=None):
        "Performs a test episode without exploration."
        # Set epsilon to 0
        previous_epsilon = self.epsilon
        self.epsilon = 0.0
        
        # Reset
        state, info = self.env.reset()
        done = False
        nb_steps = 0
        return_episode= 0

        # Sample the episode
        while not done:
            action = self.act(state)
            next_state, reward, terminal, truncated, info = self.env.step(action)
            return_episode += reward
            state = next_state
            done = terminal or truncated
            nb_steps += 1

        if recorder is not None:
            recorder.record(self.env.render())

        # Restore epsilon
        self.epsilon = previous_epsilon
            
        return return_episode, nb_steps
# Parameters
gamma = 0.9
epsilon = 0.1
alpha = 0.01
nb_episodes = 20000

# Create the environment
env = gym.make("Taxi-v3")

# Create the agent
test_agent = MonteCarloAgentTest(env, gamma, epsilon, alpha)
test_agent.Q = agent.Q


# Test the agent for 1000 episodes
test_returns = []
test_steps = []
for episode in range(1000):
    return_episode, nb_steps = test_agent.test()
    test_returns.append(return_episode)
    test_steps.append(nb_steps)
print("Test performance", np.mean(test_returns))

plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.hist(test_returns)
plt.xlabel("Returns")
plt.subplot(122)
plt.hist(test_steps)
plt.xlabel("Number of steps")
plt.show()
Test performance 5.965

Q: Visualize one episode after training. The environment used for training had no render mode, but you can always create a new environment and set it in the agent:

env = gym.make("Taxi-v3", render_mode="human") # or rgb_array_list
test_agent.env = env
env = gym.make("Taxi-v3", render_mode="rgb_array_list")
recorder = GymRecorder(env)
test_agent.env = env

return_episode, nb_steps = test_agent.test(recorder)

video = "videos/taxi-trained.gif"
recorder.make_video(video)
ipython_display(video, loop=1, autoplay=1)
MoviePy - Building file videos/taxi-trained.gif with imageio.
                                                            

A: The agent successfully picks and delivers the client at the correct location. The path is not always the shortest one (especially in the vast area in the top right), but that is fine.

Experiments

Early stopping

Q: Train the agent for the smallest number of episodes where the returns seem to have stabilized (e.g. 2000 episodes). Test the agent. Does it work? Why?

# Parameters
gamma = 0.9
epsilon = 0.1
alpha = 0.01
nb_episodes = 2000

# Create the environment
env = gym.make("Taxi-v3")

# Create the agent
agent = MonteCarloAgentTest(env, gamma, epsilon, alpha)

# Train the agent
returns, steps = agent.train(nb_episodes)

# Plot training returns
plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.plot(returns)
plt.plot(running_average(returns, 100))
plt.xlabel("Episodes")
plt.ylabel("Returns")
plt.subplot(122)
plt.plot(steps)
plt.plot(running_average(steps, 100))
plt.xlabel("Episodes")
plt.ylabel("steps")
plt.show()

# Test the agent for 1000 episodes
test_returns = []
test_steps = []
for episode in range(1000):
    return_episode, nb_steps = agent.test()
    test_returns.append(return_episode)
    test_steps.append(nb_steps)
print("Test performance", np.mean(test_returns))

plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.hist(test_returns)
plt.xlabel("Returns")
plt.subplot(122)
plt.hist(test_steps)
plt.xlabel("Number of steps")
plt.show()
Test performance -122.693

A: Although the returns are already between -20 and +10, as after 50000 episodes, the greedy policy is not optimal as some states still have a very bad policy (the agent sometimes goes back and forth between two states) or just a bad one (not optimal path). The negative returns at the end of learning are due to the exploration: the agent performs illegal pickups or dropoffs. Early in learning, this is due to bad estimates of the Q-values. The return of an episode is a bad estimate of the performance: it report both the exploration and the exploitation. Unfortunately, it is the only one we have…

Discount rate

Q: Change the value of the discount factor \gamma. As the task is episodic (maximum 200 steps), try a discount rate of 1. What happens? Conclude.

# Parameters
gamma = 1.0
epsilon = 0.1
alpha = 0.01
nb_episodes = 20000

# Create the environment
env = gym.make("Taxi-v3")

# Create the agent
agent = MonteCarloAgentTest(env, gamma, epsilon, alpha)

# Train the agent
returns, steps = agent.train(nb_episodes)

# Plot training returns
plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.plot(returns)
plt.plot(running_average(returns, 100))
plt.xlabel("Episodes")
plt.ylabel("Returns")
plt.subplot(122)
plt.plot(steps)
plt.plot(running_average(steps, 100))
plt.xlabel("Episodes")
plt.ylabel("steps")
plt.show()

# Test the agent for 1000 episodes
test_returns = []
test_steps = []
for episode in range(1000):
    return_episode, nb_steps = agent.test()
    test_returns.append(return_episode)
    test_steps.append(nb_steps)
print("Test performance", np.mean(test_returns))

plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.hist(test_returns)
plt.xlabel("Returns")
plt.subplot(122)
plt.hist(test_steps)
plt.xlabel("Number of steps")
plt.show()

Test performance -65.434

A: Rather tricky question… With a discount factor of 1, the agent does not converge as fast as with gamma = 0.9. This is due to the variance of the returns: imagine your episode is optimal all along, but at the last moment, the agent performs an illegal dropoff action. The undiscounted return of the episode will be negative and all actions taken during the episode will be punished, although only the last one is responsible for the bad return.

Using a discount factor < 1 allows the first actions to stop caring about the final rewards, as they are discounted by \gamma^T, which is very small. Take-home message: even if your task is episodic, use \gamma < 1.

Learning rate

Q: Vary the learning rate alpha. What happens?

# Parameters
gamma = 0.9
epsilon = 0.1
alpha = 0.5
nb_episodes = 20000

# Create the environment
env = gym.make("Taxi-v3")

# Create the agent
agent = MonteCarloAgentTest(env, gamma, epsilon, alpha)

# Train the agent
returns, steps = agent.train(nb_episodes)

# Plot training returns
plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.plot(returns)
plt.plot(running_average(returns, 100))
plt.xlabel("Episodes")
plt.ylabel("Returns")
plt.subplot(122)
plt.plot(steps)
plt.plot(running_average(steps, 100))
plt.xlabel("Episodes")
plt.ylabel("steps")
plt.show()

# Test the agent for 1000 episodes
test_returns = []
test_steps = []
for episode in range(1000):
    return_episode, nb_steps = agent.test()
    test_returns.append(return_episode)
    test_steps.append(nb_steps)
print("Test performance", np.mean(test_returns))

plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.hist(test_returns)
plt.xlabel("Returns")
plt.subplot(122)
plt.hist(test_steps)
plt.xlabel("Number of steps")
plt.show()
Test performance -173.495

A: If the learning rate is too high (0.5), the network does not converge and becomes unstable, as updates “erases” very quickly the previous values, what is bad given the high variance of the returns. If the learning rate is too low, learning takes forever. Classical machine learning problem… 0.01 works actually quite well.

Exploration parameter

Q: Vary the exploration parameter epsilon and observe its impact on learning.

# Parameters
gamma = 0.9
epsilon = 0.01 # or 0.3, etc.
alpha = 0.01
nb_episodes = 20000

# Create the environment
env = gym.make("Taxi-v3")

# Create the agent
agent = MonteCarloAgentTest(env, gamma, epsilon, alpha)

# Train the agent
returns, steps = agent.train(nb_episodes)

# Plot training returns
plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.plot(returns)
plt.plot(running_average(returns, 100))
plt.xlabel("Episodes")
plt.ylabel("Returns")
plt.subplot(122)
plt.plot(steps)
plt.plot(running_average(steps, 100))
plt.xlabel("Episodes")
plt.ylabel("steps")
plt.show()

# Test the agent for 1000 episodes
test_returns = []
test_steps = []
for episode in range(1000):
    return_episode, nb_steps = agent.test()
    test_returns.append(return_episode)
    test_steps.append(nb_steps)
print("Test performance", np.mean(test_returns))

plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.hist(test_returns)
plt.xlabel("Returns")
plt.subplot(122)
plt.hist(test_steps)
plt.xlabel("Number of steps")
plt.show()
Test performance -156.141

A: The exploration should be at least 0.1 in order to find the optimal policy. 0.2 or 0.3 find a good policy faster, but explore too much at the end of learning.

Exploration scheduling

Even with a good learning rate (0.01) and a discount factor of 0.9, the exploration parameter as a huge impact on the performance: too low and the agent does not find the optimal policy, too high and the agent is inefficient at the end of learning.

Q: Implement scheduling for epsilon. You can use exponential scheduling as in the bandits exercise:

\epsilon = \epsilon \times (1 - \epsilon_\text{decay})

at the end of each episode, with \epsilon_\text{decay} being a small decay parameter (1e-5 or so).

Find a correct value for \epsilon_\text{decay}. Do not hesitate to fine-tune alpha at the same time.

Tip: Prepare and visualize the scheduling in a different cell, and use the initial value of \epsilon and \epsilon_\text{decay} that seem to make sense.

class DecayMonteCarloAgent (MonteCarloAgentTest):
    """
    Monte-Carlo agent with decay of the exploration parameter.
    """
    
    def __init__(self, env, gamma, epsilon, decay_epsilon, alpha):
        """
        :param env: gym-like environment
        :param gamma: discount factor
        :param epsilon: exploration parameter
        :param decay_epsilon: exploration decay parameter
        :param alpha: learning rate
        """
        self.decay_epsilon = decay_epsilon
        
        super().__init__(env, gamma, epsilon, alpha)
    
    
    def train(self, nb_episodes, recorder=None):
        "Runs the agent on the environment for nb_episodes. Returns the list of obtained returns and steps."

        # Returns
        returns = []
        steps = []

        # Fixed number of episodes
        for episode in range(nb_episodes):

            # Reset
            state, info = self.env.reset()
            done = False
            return_episode = 0.0
            nb_steps = 0

            # Store transitions
            transitions = []

            # Sample the episode
            while not done:

                # Select an action 
                action = self.act(state)

                # Perform the action
                next_state, reward, terminal, truncated, info = self.env.step(action)

                # Store the transition
                transitions.append([state, action, reward])

                # Go in the next state
                state = next_state

                # Increment time
                nb_steps += 1
                return_episode += reward
                
                # Terminal state
                done = terminal or truncated

            # Update the Monte Carlo agent after the episode is completed
            self.update(transitions)   
            
            # Decay epsilon
            self.epsilon = self.epsilon * (1 - self.decay_epsilon)

            # Store info
            returns.append(return_episode)
            steps.append(nb_steps)
            
            
        return returns, steps
# Parameters
gamma = 0.9
epsilon = 0.4
decay_epsilon = 1e-4
alpha = 0.05
nb_episodes = 20000

# Create the environment
env = gym.make("Taxi-v3")

# Create the agent
agent = DecayMonteCarloAgent(env, gamma, epsilon, decay_epsilon, alpha)

# Train the agent
returns, steps = agent.train(nb_episodes)

# Plot training returns
plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.plot(returns)
plt.plot(running_average(returns, 100))
plt.xlabel("Episodes")
plt.ylabel("Returns")
plt.subplot(122)
plt.plot(steps)
plt.plot(running_average(steps, 100))
plt.xlabel("Episodes")
plt.ylabel("steps")
plt.show()

# Test the agent for 1000 episodes
test_returns = []
test_steps = []
for episode in range(1000):
    return_episode, nb_steps = agent.test()
    test_returns.append(return_episode)
    test_steps.append(nb_steps)
print("Test performance", np.mean(test_returns))

plt.figure(figsize=(15, 6))
plt.subplot(121)
plt.hist(test_returns)
plt.xlabel("Returns")
plt.subplot(122)
plt.hist(test_steps)
plt.xlabel("Number of steps")
plt.show()
Test performance 7.401

A: As seen with bandits, decaying the exploration parameter with the right schedule improves very significantly the speed of learning and the optimality.